Cox regression modeling of survival after chemotherapy for colon cancer

Author

Hermela Shimelis

Published

November 14, 2024

Data: Survival after chemotherapy for Stage B/C colon cancer [1] [2]

Description

These are data from one of the first successful trials of adjuvant chemotherapy for colon cancer. Levamisole is a low-toxicity compound previously used to treat worm infestations in animals; 5-FU is a moderately toxic (as these things go) chemotherapy agent. There are two records per person, one for recurrence and one for death.

The purpose of this project is to compare survival between the untreated (Obs) group vs those treated with amisole (Lev), or amisole + 5-FU.

Column names:

id: id
study: 1 for all patients
rx: Treatment - Obs(ervation), Lev(amisole), Lev(amisole)+5-FU
sex: 1=male
age: in years
obstruct: obstruction of colon by tumour
perfor: perforation of colon
adhere: adherence to nearby organs
nodes: number of lymph nodes with detectable cancer
time: days until event or censoring
status: censoring status
differ: differentiation of tumour (1=well, 2=moderate, 3=poor)
extent: Extent of local spread (1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures)
surg: time from surgery to registration (0=short, 1=long)
node4: more than 4 positive lymph nodes
etype: event type: 1=recurrence,2=death
# Load library
library(dplyr)
library(survival)
library(janitor)
library(magrittr)
library(car)
library(ggplot2)
library(tidyverse)
library(broom)
library(MASS)
library(boot)

#print(citation("survival"), bibtex=TRUE)
#Load data
colon <- as_tibble(colon)
head(colon)
# A tibble: 6 × 16
     id study rx        sex   age obstruct perfor adhere nodes status differ
  <dbl> <dbl> <fct>   <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
1     1     1 Lev+5FU     1    43        0      0      0     5      1      2
2     1     1 Lev+5FU     1    43        0      0      0     5      1      2
3     2     1 Lev+5FU     1    63        0      0      0     1      0      2
4     2     1 Lev+5FU     1    63        0      0      0     1      0      2
5     3     1 Obs         0    71        0      0      1     7      1      2
6     3     1 Obs         0    71        0      0      1     7      1      2
# ℹ 5 more variables: extent <dbl>, surg <dbl>, node4 <dbl>, time <dbl>,
#   etype <dbl>

Since the current analysis is focused on survival, filter data to death as the event type. This will create a data table with one row per individual.

colon_surv <- colon%>%filter(etype == 1) 

I. Exploratory data analysis

Check missing values

na_counts <- sapply(colon_surv, function(x)sum(is.na(x)))
na_counts
      id    study       rx      sex      age obstruct   perfor   adhere 
       0        0        0        0        0        0        0        0 
   nodes   status   differ   extent     surg    node4     time    etype 
      18        0       23        0        0        0        0        0 
# replace NAs with mode
table(colon_surv$differ)

  1   2   3 
 93 663 150 
mode(colon_surv$differ)
[1] "numeric"
median(colon_surv$nodes, na.rm= TRUE)
[1] 2
colon_surv$differ <- if_else(is.na(colon_surv$differ), 2,colon_surv$differ) 
colon_surv$nodes <- if_else(is.na(colon_surv$nodes), 2,colon_surv$nodes)

Insight: only nodes and differ columns have NA values. Replacing the 23 NAs in differ column with mode, and replace NAs in nodes with median.

Evaluate continuous variables

# age
hist(colon_surv$age)

hist(colon_surv$nodes)

hist(colon_surv$time)

Insight: Age is normally distributed. Number of nodes is skewed to the right. Time is fairly normally distributed with most the individuals having event time between 500-3000 days.

Evaluate nodes column to investigate outliers

t <- colon_surv%>%filter(node4 ==1) # samples with more than 4 positive lymph nodes
hist(t$nodes) 

Insight: samples with greater than 4 lymph nodes have less than 5 count in nodes column, so the two columns are not consistent. Therefore, nodes column will not be used for further analysis.

Evaluate categorical variables

summary_table <- colon_surv%>%summarise(count =n(),
                                        male = sum(sex), 
                                                       median_age = median(age),
                                                       ct_perforation = sum(perfor),
                                                       ct_adherence_nerby_organ = sum(adhere), censored = sum(status))

summary_table
# A tibble: 1 × 6
  count  male median_age ct_perforation ct_adherence_nerby_organ censored
  <int> <dbl>      <dbl>          <dbl>                    <dbl>    <dbl>
1   929   484         61             27                      135      468

Insight: Total number of participants: 929. About half of the participants are male and about half were censored, while the other half died.

# rename categorical columns to make them descriptive
colon_surv <- colon_surv%>%mutate(differentiation = case_when(differ == 1 ~ "well",
                                                              differ == 2 ~ "moderate",
                                                              differ == 3 ~ "poor"),
                                  local_spread = case_when(extent == 1 ~ "submucosa",
                                                           extent == 2 ~ "muscle",
                                                           extent == 3 ~ "serosa",
                                                           extent == 4 ~ "contiguous"),
                                  surg_to_reg_time = case_when(surg == 0~ "short",
                                                               surg == 1 ~ "long"))

Frequency tables for categorical variables

# frequency tables for categorical variables
# Tumor differentiation

colon_surv %>% 
  tabyl(differentiation, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 differentiation         Obs         Lev     Lev+5FU
        moderate 74.9% (236) 73.9% (229) 72.7% (221)
            poor 16.5%  (52) 14.2%  (44) 17.8%  (54)
            well  8.6%  (27) 11.9%  (37)  9.5%  (29)
# extent of local spread
colon_surv %>% 
  tabyl(local_spread, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 local_spread         Obs         Lev     Lev+5FU
   contiguous  6.3%  (20)  3.9%  (12)  3.6%  (11)
       muscle 12.1%  (38) 11.6%  (36) 10.5%  (32)
       serosa 79.0% (249) 83.5% (259) 82.6% (251)
    submucosa  2.5%   (8)  1.0%   (3)  3.3%  (10)
# colum obstruction
colon_surv %>% 
  tabyl(obstruct, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 obstruct         Obs         Lev     Lev+5FU
        0 80.0% (252) 79.7% (247) 82.2% (250)
        1 20.0%  (63) 20.3%  (63) 17.8%  (54)
# colon perforation
colon_surv %>% 
  tabyl(perfor, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 perfor         Obs         Lev     Lev+5FU
      0 97.1% (306) 96.8% (300) 97.4% (296)
      1  2.9%   (9)  3.2%  (10)  2.6%   (8)
# Adherance to nearby organs
colon_surv %>% 
  tabyl(adhere, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 adhere         Obs         Lev     Lev+5FU
      0 85.1% (268) 84.2% (261) 87.2% (265)
      1 14.9%  (47) 15.8%  (49) 12.8%  (39)
# extent of local tumor spread
colon_surv %>% 
  tabyl(local_spread, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 local_spread         Obs         Lev     Lev+5FU
   contiguous  6.3%  (20)  3.9%  (12)  3.6%  (11)
       muscle 12.1%  (38) 11.6%  (36) 10.5%  (32)
       serosa 79.0% (249) 83.5% (259) 82.6% (251)
    submucosa  2.5%   (8)  1.0%   (3)  3.3%  (10)
# More than 4 lymph nodes with cancer
colon_surv %>% 
  tabyl(node4, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>%
  adorn_ns()
 node4         Obs         Lev     Lev+5FU
     0 72.4% (228) 71.3% (221) 74.0% (225)
     1 27.6%  (87) 28.7%  (89) 26.0%  (79)
# time from surgery to registration
colon_surv %>% 
  tabyl(surg, rx) %>%
  adorn_percentages("col") %>%
  adorn_pct_formatting(digits = 1) %>% 
  adorn_ns()
 surg         Obs         Lev     Lev+5FU
    0 71.1% (224) 74.2% (230) 75.0% (228)
    1 28.9%  (91) 25.8%  (80) 25.0%  (76)

Summary statistics grouped by treatment

summary_table <- colon_surv%>%group_by(rx)%>%summarise(count =n(),
                                                       male = sum(sex),
                                                       median_age = median(age),
                                                       ct_perforation = sum(perfor),
                                                       ct_adherence_nerby_organ = sum(adhere),
                                                       perc_male = (male/count)*100,
                                                       iqr_age = IQR(age))
summary_table
# A tibble: 3 × 8
  rx      count  male median_age ct_perforation ct_adherence_nerby_o…¹ perc_male
  <fct>   <int> <dbl>      <dbl>          <dbl>                  <dbl>     <dbl>
1 Obs       315   166         60              9                     47      52.7
2 Lev       310   177         61             10                     49      57.1
3 Lev+5FU   304   141         62              8                     39      46.4
# ℹ abbreviated name: ¹​ct_adherence_nerby_organ
# ℹ 1 more variable: iqr_age <dbl>

Insight: Each treatment group had about 300 participants. Median age, number of participants with perforation and adherence are similar between the three groups.

II. Table 1: Description of the study population

Observation (%) Amisole (%) Amisole + 5-FU (%)
N=315 N=310 N=304
Demographics
Male 166 (52.3) 177 (57.1) 141
Median age (years) [IQR] 60 [53,68] 61 [53,69] 61 [52,70]
Cancer characteristics
Colon obstruction 63 (20.0) 63 (20.3) 54 (17.8)
Colon perforation 9 (2.9) 10 (3.2) 8 (2.6)
Adherence to nearby organs 47 (14.9) 49 (15.8) 39 (12.8)
Differentiation of tumor
Well 27 (8.6) 37 (11.9) 29 (9.5)
Moderate 236 (74.9) 229 (73.9) 221 (72.7)
Poor 52 (16.5) 44 (14.2) 54 (17.8)
Extent of local spread
Contiguous 20 (6.3) 12 (3.9) 11 (3.6)
Muscle 38 (12.1) 36 (11.6) 32 (10.5)
Serosa 249 (79.0) 259 (83.5) 251 (82.6)
Submucosa 8 (2.5) 3 (1.0) 10 (3.3)
More than 4 lymph nodes with cancer Yes 87 (27.6) 89 (28.7) 79 (26.0)
Short time from surgery to registration (%) Yes 91 (28.9) 80 (25.8) 76 (25.0)

III. Methods

The Cox proportional hazards model was used to model the relationship between survival time and different lung cancer treatments. In particlular the survival time will be compared between the untreated group (observation) vs. those treated with amisole (Lev), or amisole + 5-FU. The Cox regression model was chosen for this study because it is useful for studying association between survival time of patients and predictors and allows estimating the relative risk or hazard ratios due to the covariates, i.e., treatment status. The time (in days) until event, i.e, death, will be modeled as a function of treatment and other variables, including age, sex, and various tumor characteristics. Significant predictors were included in the final model.

Statistical analysis

The R statistical software version 4.3.2 [3] was used for all analysis. The Survival package was used to construct the Cox regression model [2] [1].

Cox regression model is based on the hazard function \(h_x(t)\) with covariates at time t given by:

\(h_x(t)=h_0(t)\exp(\beta_1x_1 +\beta_2x_2 + \dots + \beta_p x_p)\)

Where:

\(h_x(t)\) is the hazard function

\(h_0(t)\) is the baseline hazard function

\(\beta_1x_1 + \beta_2x_2 + \dots +\beta_p x_p\) represent the linear combination of covariates and their coefficient

The hazards for the observation vs. amisole (Lev), or amisole + 5-FU group with covariate values x1 and x2 are given by: \(hx_1(t)=h_0(t)\exp(\beta_1x_1)\) and \(hx_2(t)=h_0(t)\exp(\beta_2x_2)\), respectively

The hazard ratio is expressed as: HR = \(hx_2(t)\) / \(hx_1(t)\) = \(\exp[\beta(x_2-x_1)]\)

The Schoenfeld residual plot was constructed to test Cox proportional hazards assumption. When the proportional hazards assumpiton was not met for any of the covariates, stratification approach was explored. The Survminer [4] package was used to plot the Kaplan-Meier curve to visualize the survival probability over time for each treatment group.

Multicolinearity was tested using Variant Inflation Factor (VIF) calculated using MASS package [5].

The R MASS package was used for Stepwise model selection, using “both” forward and backward variable selection [5]. For Stepwise selection, stepAIC() function uses AIC (Akaike Information Criterion) as the measure to add or remove predictors from the model. Model performance was evaluated using 100-fold cross-validation using the boot package [6] [7].

IV. Analysis: Cox regression model

Survival curve

# Create new incremental count id
colon_surv$idcount <- c(1:length(colon_surv$id))

# Order by survival time and create an order variable:
colon_surv <- colon_surv[order(-colon_surv$time, colon_surv$status),]
colon_surv$order <- c(1:length(colon_surv$idcount))

ggplot(data=colon_surv, aes(x=time, y=order)) +
geom_rect(xmin=23,xmax=colon_surv$time,ymin=colon_surv$order,ymax=colon_surv$order+1, colour="lightgray") +
geom_rect(xmin=colon_surv$time-2,xmax=colon_surv$time,ymin=colon_surv$order,ymax=colon_surv$order+1,
          fill=factor(colon_surv$status+1)) +
geom_vline(xintercept= 1976,linetype="solid") +
scale_x_continuous(breaks=seq(20,3330,650)) +
geom_text(aes(2600, 750, label="Median Survival Time")) +
xlab("Survival Time (Days)") + ylab("Participants (ordered by survival time)") +
ggtitle("Survival Times for Participant") +
theme_classic() +
theme(legend.position="none",
      panel.grid.major=element_blank(),
      panel.grid.minor=element_blank(),
      panel.background=element_blank(),
      axis.line.x = element_line(color = "black"),
      axis.line.y = element_line(color = "black"))

Survival curve stratified by treatment group

library(survminer)
library(survival)

# Estimate the median survival time among the three groups
survfit(Surv(time,status) ~ rx, data = colon_surv)
Call: survfit(formula = Surv(time, status) ~ rx, data = colon_surv)

             n events median 0.95LCL 0.95UCL
rx=Obs     315    177   1236     803    2036
rx=Lev     310    172   1183     797    2067
rx=Lev+5FU 304    119     NA      NA      NA
# count the number of events after 2080 days, which is the median survival time among the observation group
tt <- colon_surv%>%filter(time > 2083)%>% group_by(rx)%>%summarise(ct = n(),
                                                                   death = sum(status))
# Plot survival curve
fit <- survfit(Surv(time,status) ~ rx, data = colon_surv)
ggsurvplot(fit, data=colon_surv, risk.table = TRUE)

# Estimate the probability of surviving beyond 3000 days
summary(survfit(Surv(time, status) ~ rx, data = colon_surv), times = 3000)
Call: survfit(formula = Surv(time, status) ~ rx, data = colon_surv)

                rx=Obs 
        time       n.risk      n.event     survival      std.err lower 95% CI 
    3.00e+03     5.00e+00     1.77e+02     4.07e-01     3.35e-02     3.47e-01 
upper 95% CI 
    4.79e-01 

                rx=Lev 
        time       n.risk      n.event     survival      std.err lower 95% CI 
    3.00e+03     4.00e+00     1.72e+02     4.33e-01     2.87e-02     3.80e-01 
upper 95% CI 
    4.93e-01 

                rx=Lev+5FU 
        time       n.risk      n.event     survival      std.err lower 95% CI 
    3.00e+03     7.00e+00     1.19e+02     5.99e-01     2.86e-02     5.46e-01 
upper 95% CI 
    6.58e-01 
# compare significant differences in survival times between the three groups
survdiff(Surv(time, status)~ rx, data = colon_surv)
Call:
survdiff(formula = Surv(time, status) ~ rx, data = colon_surv)

             N Observed Expected (O-E)^2/E (O-E)^2/V
rx=Obs     315      177      151      4.62      6.82
rx=Lev     310      172      149      3.69      5.41
rx=Lev+5FU 304      119      169     14.69     23.04

 Chisq= 23.1  on 2 degrees of freedom, p= 1e-05 

Insight: Based on the survival curve, the mediant survival time for the observation group is 2083 days. However, the median survival of Lev and Lev+5Fu group cannot be estimated, because there are too few events after 2083 days, which is the median survival time in the observation group.

The time for 50% survival probability of the group treated with Lev+5Fu is over 3000 days while the survival time for the observation and Lev group is around 2080 days. The probability of surviving to 3000 days among the Lev+5FU group is 56% (95% CI: 50-63), compared to 41% among the observation group.

The survival time is significantly different (P=0.003) between the three groups.

Insight: None of the variables have VIF values above 5, therefore there is no multicollinearity

Cox regression models

# Subset data for modeling
df <- colon_surv%>%dplyr::select(!c(id,study,etype,differ, extent,surg_to_reg_time, idcount, order, nodes))

Base Model

m0 <- coxph(Surv(time, status) ~ 1, data = df)
summary_m0 = summary(m0)

c_index_m0 <- concordance(m0)

cat("Concordance of the base model:",c_index_m0$concordance)
Concordance of the base model: 0.5

Univariate analysis

#| echo: true
#| message: false
#| warning: false

# Univariate analysis
m1 <- coxph(Surv(time, status) ~ rx, data = df)
summary(m1)
Call:
coxph(formula = Surv(time, status) ~ rx, data = df)

  n= 929, number of events= 468 

              coef exp(coef) se(coef)      z Pr(>|z|)    
rxLev     -0.01512   0.98499  0.10708 -0.141    0.888    
rxLev+5FU -0.51209   0.59924  0.11863 -4.317 1.58e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

          exp(coef) exp(-coef) lower .95 upper .95
rxLev        0.9850      1.015    0.7985    1.2150
rxLev+5FU    0.5992      1.669    0.4749    0.7561

Concordance= 0.554  (se = 0.013 )
Likelihood ratio test= 24.34  on 2 df,   p=5e-06
Wald test            = 22.58  on 2 df,   p=1e-05
Score (logrank) test = 23.07  on 2 df,   p=1e-05
c_index_m1 <- concordance(m1)
cat("Concordance of the univariate model:",c_index_m1$concordance)
Concordance of the univariate model: 0.5544598
anova(m0, m1) # Addition of rx variable significantly improved base model
Analysis of Deviance Table
 Cox model: response is  Surv(time, status)
 Model 1: ~ 1
 Model 2: ~ rx
   loglik  Chisq Df Pr(>|Chi|)    
1 -3040.3                         
2 -3028.1 24.343  2  5.175e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Insight: the coefficient of Lev is not significant, suggesting that there is no evidence that this treatment affects survival time compared to observation. however Lev+5Fu is significant (p=0.00175), indicating that the treatment Lev +5Fu has a statistically significant effect on survival time compared to the reference group. The negative sign indicates that this treatment group has a lower hazard and likely a longer survival time.

The hazard ratio for Lex+5FU (0.690), indicating the risk of death is about 31% lower compared to the observation group.

The p-values indicate that the model is significant.

Test the Cox proportional hazard assumption of m1

cox.zph(m1)
       chisq df    p
rx     0.301  2 0.86
GLOBAL 0.301  2 0.86
zph_test <- cox.zph(m1)

print(zph_test)
       chisq df    p
rx     0.301  2 0.86
GLOBAL 0.301  2 0.86
# plot the Schoenfeld residuals
plot(zph_test)

Insight: The Schoenfeld residal plot shows that the residuals are scattered randomly and the smooth trend line is horizontal near 0. This suggests that the hazard ratio for rx (treatment status) is constant over time and the proportional hazard assumption is met. The global p-value is >0.05, indicating that the the assumption is met.

Multivariate analysis

# Include all variables to determine which predictors are significant.

names(df)
 [1] "rx"              "sex"             "age"             "obstruct"       
 [5] "perfor"          "adhere"          "status"          "surg"           
 [9] "node4"           "time"            "differentiation" "local_spread"   
# multivariate analysis
m2 <- coxph(Surv(time, status) ~ rx+ age + sex + perfor + adhere + surg + obstruct + differentiation + node4+
              local_spread, data = df)  
summary(m2)
Call:
coxph(formula = Surv(time, status) ~ rx + age + sex + perfor + 
    adhere + surg + obstruct + differentiation + node4 + local_spread, 
    data = df)

  n= 929, number of events= 468 

                           coef exp(coef)  se(coef)      z Pr(>|z|)    
rxLev                 -0.000276  0.999724  0.108407 -0.003  0.99797    
rxLev+5FU             -0.513501  0.598397  0.119205 -4.308 1.65e-05 ***
age                   -0.002809  0.997195  0.003923 -0.716  0.47400    
sex                   -0.080319  0.922822  0.093672 -0.857  0.39120    
perfor                 0.188904  1.207925  0.255841  0.738  0.46029    
adhere                 0.168813  1.183898  0.128257  1.316  0.18811    
surg                   0.247327  1.280597  0.101407  2.439  0.01473 *  
obstruct               0.199681  1.221013  0.115875  1.723  0.08484 .  
differentiationpoor    0.338549  1.402910  0.121115  2.795  0.00519 ** 
differentiationwell    0.024189  1.024484  0.162583  0.149  0.88173    
node4                  0.842349  2.321814  0.097335  8.654  < 2e-16 ***
local_spreadmuscle    -1.048500  0.350463  0.259637 -4.038 5.38e-05 ***
local_spreadserosa    -0.478763  0.619549  0.199176 -2.404  0.01623 *  
local_spreadsubmucosa -1.113624  0.328367  0.491639 -2.265  0.02351 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      exp(coef) exp(-coef) lower .95 upper .95
rxLev                    0.9997     1.0003    0.8084    1.2364
rxLev+5FU                0.5984     1.6711    0.4737    0.7559
age                      0.9972     1.0028    0.9896    1.0049
sex                      0.9228     1.0836    0.7680    1.1088
perfor                   1.2079     0.8279    0.7316    1.9944
adhere                   1.1839     0.8447    0.9207    1.5223
surg                     1.2806     0.7809    1.0498    1.5622
obstruct                 1.2210     0.8190    0.9729    1.5323
differentiationpoor      1.4029     0.7128    1.1065    1.7788
differentiationwell      1.0245     0.9761    0.7449    1.4090
node4                    2.3218     0.4307    1.9186    2.8098
local_spreadmuscle       0.3505     2.8534    0.2107    0.5830
local_spreadserosa       0.6195     1.6141    0.4193    0.9154
local_spreadsubmucosa    0.3284     3.0454    0.1253    0.8607

Concordance= 0.669  (se = 0.012 )
Likelihood ratio test= 149.8  on 14 df,   p=<2e-16
Wald test            = 150.7  on 14 df,   p=<2e-16
Score (logrank) test = 160.4  on 14 df,   p=<2e-16
c_index_m2 <- concordance(m2)
cat("Concordance of the multivariate model:",c_index_m2$concordance)
Concordance of the multivariate model: 0.6694041
# Determine significant predictors
anova(m2)
Analysis of Deviance Table
 Cox model: response is Surv(time, status)
Terms added sequentially (first to last)

                 loglik   Chisq Df Pr(>|Chi|)    
NULL            -3040.3                          
rx              -3028.1 24.3435  2  5.175e-06 ***
age             -3026.8  2.5853  1  0.1078626    
sex             -3026.2  1.3125  1  0.2519380    
perfor          -3025.2  1.8472  1  0.1741088    
adhere          -3022.9  4.7199  1  0.0298156 *  
surg            -3020.3  5.1721  1  0.0229519 *  
obstruct        -3019.1  2.3959  1  0.1216554    
differentiation -3012.9 12.3380  2  0.0020933 ** 
node4           -2975.3 75.1393  1  < 2.2e-16 ***
local_spread    -2965.4 19.9397  3  0.0001747 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m0, m1, m2)
Analysis of Deviance Table
 Cox model: response is  Surv(time, status)
 Model 1: ~ 1
 Model 2: ~ rx
 Model 3: ~ rx + age + sex + perfor + adhere + surg + obstruct + differentiation + node4 + local_spread
   loglik   Chisq Df Pr(>|Chi|)    
1 -3040.3                          
2 -3028.1  24.343  2  5.175e-06 ***
3 -2965.4 125.450 12  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Insight: When all variables are included in the model, the anova test indicates that rx, adhere, surg, obstruct, differentiation, node4 and local spread are significant predictors. Additionally, model concordance did not improve when removing predictors that were not significant in m2

The concordance of the multivariable model, 0.674, is higher than the univariate model (m1, concordance =0.53), suggesting that the multivariate model is a better fit model.

Evaluate multicollinearity using Variance Inflation Factor (VIF)

vif <- vif(m2)
print(vif)
                    GVIF Df GVIF^(1/(2*Df))
rx              1.033078  2        1.008169
age             1.036544  1        1.018108
sex             1.025735  1        1.012786
perfor          1.071324  1        1.035048
adhere          1.111746  1        1.054393
surg            1.012413  1        1.006187
obstruct        1.047540  1        1.023494
differentiation 1.052692  2        1.012920
node4           1.035800  1        1.017742
local_spread    1.107687  3        1.017192

Evaluate significance of predictors. Model survival while including different cancer characteristics as predictors separately to identify significance predictors.

# model including all variables
m2 <- coxph(Surv(time, status) ~ rx+ age + sex + perfor + adhere + surg + obstruct + differentiation + node4+
              local_spread, data = df)  


# Treatment
m2a <- coxph(Surv(time, status) ~ rx, data = df) # significant
summary(m2a)
Call:
coxph(formula = Surv(time, status) ~ rx, data = df)

  n= 929, number of events= 468 

              coef exp(coef) se(coef)      z Pr(>|z|)    
rxLev     -0.01512   0.98499  0.10708 -0.141    0.888    
rxLev+5FU -0.51209   0.59924  0.11863 -4.317 1.58e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

          exp(coef) exp(-coef) lower .95 upper .95
rxLev        0.9850      1.015    0.7985    1.2150
rxLev+5FU    0.5992      1.669    0.4749    0.7561

Concordance= 0.554  (se = 0.013 )
Likelihood ratio test= 24.34  on 2 df,   p=5e-06
Wald test            = 22.58  on 2 df,   p=1e-05
Score (logrank) test = 23.07  on 2 df,   p=1e-05
# Demographics
m2b <- coxph(Surv(time, status) ~ age + sex, data = df) # not significant
summary(m2b)
Call:
coxph(formula = Surv(time, status) ~ age + sex, data = df)

  n= 929, number of events= 468 

         coef exp(coef)  se(coef)      z Pr(>|z|)  
age -0.006715  0.993308  0.003880 -1.731   0.0835 .
sex -0.084298  0.919157  0.092483 -0.911   0.3620  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    exp(coef) exp(-coef) lower .95 upper .95
age    0.9933      1.007    0.9858     1.001
sex    0.9192      1.088    0.7668     1.102

Concordance= 0.527  (se = 0.014 )
Likelihood ratio test= 3.77  on 2 df,   p=0.2
Wald test            = 3.82  on 2 df,   p=0.1
Score (logrank) test = 3.82  on 2 df,   p=0.1
# cancer characteristics
m2c <- coxph(Surv(time, status) ~ perfor + adhere + obstruct, data = df) # adhere and obstruct are significant
summary(m2c)
Call:
coxph(formula = Surv(time, status) ~ perfor + adhere + obstruct, 
    data = df)

  n= 929, number of events= 468 

           coef exp(coef) se(coef)     z Pr(>|z|)  
perfor   0.1729    1.1888   0.2554 0.677   0.4983  
adhere   0.2923    1.3395   0.1242 2.353   0.0186 *
obstruct 0.2194    1.2454   0.1146 1.914   0.0556 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

         exp(coef) exp(-coef) lower .95 upper .95
perfor       1.189     0.8412    0.7207     1.961
adhere       1.340     0.7465    1.0500     1.709
obstruct     1.245     0.8030    0.9948     1.559

Concordance= 0.539  (se = 0.011 )
Likelihood ratio test= 10.61  on 3 df,   p=0.01
Wald test            = 11.4  on 3 df,   p=0.01
Score (logrank) test = 11.49  on 3 df,   p=0.009
# Differentiation of tumor
m2d <- coxph(Surv(time, status) ~ differentiation, data = df) # significant
summary(m2d)
Call:
coxph(formula = Surv(time, status) ~ differentiation, data = df)

  n= 929, number of events= 468 

                       coef exp(coef) se(coef)      z Pr(>|z|)    
differentiationpoor  0.4460    1.5621   0.1199  3.721 0.000198 ***
differentiationwell -0.0523    0.9490   0.1603 -0.326 0.744262    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                    exp(coef) exp(-coef) lower .95 upper .95
differentiationpoor     1.562     0.6402    1.2350     1.976
differentiationwell     0.949     1.0537    0.6931     1.299

Concordance= 0.541  (se = 0.011 )
Likelihood ratio test= 13.35  on 2 df,   p=0.001
Wald test            = 14.68  on 2 df,   p=6e-04
Score (logrank) test = 14.93  on 2 df,   p=6e-04
# Extent of local spread
m2e <- coxph(Surv(time, status) ~ local_spread, data = df) # significant
summary(m2e)
Call:
coxph(formula = Surv(time, status) ~ local_spread, data = df)

  n= 929, number of events= 468 

                         coef exp(coef) se(coef)      z Pr(>|z|)    
local_spreadmuscle    -1.2391    0.2896   0.2533 -4.891    1e-06 ***
local_spreadserosa    -0.5663    0.5676   0.1927 -2.939  0.00329 ** 
local_spreadsubmucosa -1.5668    0.2087   0.4846 -3.233  0.00122 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      exp(coef) exp(-coef) lower .95 upper .95
local_spreadmuscle       0.2896      3.453   0.17628    0.4759
local_spreadserosa       0.5676      1.762   0.38909    0.8281
local_spreadsubmucosa    0.2087      4.791   0.08074    0.5395

Concordance= 0.551  (se = 0.009 )
Likelihood ratio test= 32.65  on 3 df,   p=4e-07
Wald test            = 29.3  on 3 df,   p=2e-06
Score (logrank) test = 30.96  on 3 df,   p=9e-07
# short time from surgery to registration
m2f <- coxph(Surv(time, status) ~ surg, data = df) # significant
summary(m2f)
Call:
coxph(formula = Surv(time, status) ~ surg, data = df)

  n= 929, number of events= 468 

       coef exp(coef) se(coef)     z Pr(>|z|)  
surg 0.2549    1.2903   0.1008 2.529   0.0114 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

     exp(coef) exp(-coef) lower .95 upper .95
surg      1.29      0.775     1.059     1.572

Concordance= 0.523  (se = 0.01 )
Likelihood ratio test= 6.17  on 1 df,   p=0.01
Wald test            = 6.39  on 1 df,   p=0.01
Score (logrank) test = 6.43  on 1 df,   p=0.01
# include predictors significant in the model which included all predictors (m2)
m3 <- coxph(Surv(time, status) ~ rx + surg + obstruct + differentiation + node4
              + local_spread, data = df)

summary(m3)
Call:
coxph(formula = Surv(time, status) ~ rx + surg + obstruct + differentiation + 
    node4 + local_spread, data = df)

  n= 929, number of events= 468 

                            coef  exp(coef)   se(coef)      z Pr(>|z|)    
rxLev                  0.0007498  1.0007501  0.1081783  0.007  0.99447    
rxLev+5FU             -0.5078980  0.6017591  0.1192067 -4.261 2.04e-05 ***
surg                   0.2513711  1.2857872  0.1013684  2.480  0.01315 *  
obstruct               0.2215931  1.2480634  0.1142954  1.939  0.05253 .  
differentiationpoor    0.3531967  1.4236111  0.1205339  2.930  0.00339 ** 
differentiationwell    0.0340964  1.0346843  0.1617970  0.211  0.83309    
node4                  0.8455285  2.3292085  0.0967209  8.742  < 2e-16 ***
local_spreadmuscle    -1.1345405  0.3215698  0.2552751 -4.444 8.81e-06 ***
local_spreadserosa    -0.5565149  0.5732032  0.1935786 -2.875  0.00404 ** 
local_spreadsubmucosa -1.1924159  0.3034872  0.4872601 -2.447  0.01440 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      exp(coef) exp(-coef) lower .95 upper .95
rxLev                    1.0008     0.9993    0.8096    1.2371
rxLev+5FU                0.6018     1.6618    0.4764    0.7601
surg                     1.2858     0.7777    1.0541    1.5684
obstruct                 1.2481     0.8012    0.9976    1.5614
differentiationpoor      1.4236     0.7024    1.1241    1.8030
differentiationwell      1.0347     0.9665    0.7535    1.4208
node4                    2.3292     0.4293    1.9270    2.8154
local_spreadmuscle       0.3216     3.1097    0.1950    0.5304
local_spreadserosa       0.5732     1.7446    0.3922    0.8377
local_spreadsubmucosa    0.3035     3.2950    0.1168    0.7887

Concordance= 0.667  (se = 0.012 )
Likelihood ratio test= 146.1  on 10 df,   p=<2e-16
Wald test            = 147.8  on 10 df,   p=<2e-16
Score (logrank) test = 157  on 10 df,   p=<2e-16
c_index_m3 <- concordance(m3)
cat("Concordance of the multivariate model2:",c_index_m3$concordance)
Concordance of the multivariate model2: 0.6667876
anova(m0, m1, m2, m2a, m2b, m2c, m2d, m2e, m2f, m3)
Analysis of Deviance Table
 Cox model: response is  Surv(time, status)
 Model  1: ~ 1
 Model  2: ~ rx
 Model  3: ~ rx + age + sex + perfor + adhere + surg + obstruct + differentiation + node4 + local_spread
 Model  4: ~ rx
 Model  5: ~ age + sex
 Model  6: ~ perfor + adhere + obstruct
 Model  7: ~ differentiation
 Model  8: ~ local_spread
 Model  9: ~ surg
 Model 10: ~ rx + surg + obstruct + differentiation + node4 + local_spread
    loglik    Chisq Df Pr(>|Chi|)    
1  -3040.3                           
2  -3028.1  24.3435  2  5.175e-06 ***
3  -2965.4 125.4499 12  < 2.2e-16 ***
4  -3028.1 125.4499 12  < 2.2e-16 ***
5  -3038.4  20.5731  0  < 2.2e-16 ***
6  -3035.0   6.8362  1   0.008933 ** 
7  -3033.6   2.7482  1   0.097367 .  
8  -3023.9  19.2935  1  1.121e-05 ***
9  -3037.2  26.4771  2  1.781e-06 ***
10 -2967.2 139.8791  9  < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Insight: rx, adhere, surg, obstruct, differentiation and local_spread are significant predictors. However, the model concordance is low (~0.5) when each was included separately. Model m2, which included all predictors, followed by model 3 (including selected significant predictors) have the highest concordance.

Perform Stepwise variable selection:

library(MASS)       # for stepwise regression
#### Use the MASS package stepAIC() function for stepwise selection by using AIC (Akaike Information Criterion) as the measure to add or remove predictors from the model.

# model including all variables
m2 <- coxph(Surv(time, status) ~ rx+ age + sex + perfor + adhere + surg + obstruct + differentiation + node4+
              local_spread, data = df)  

# stepwise selection
stepwise_model <- stepAIC(m2, direction = "both")
Start:  AIC=5958.76
Surv(time, status) ~ rx + age + sex + perfor + adhere + surg + 
    obstruct + differentiation + node4 + local_spread

                  Df    AIC
- age              1 5957.3
- perfor           1 5957.3
- sex              1 5957.5
- adhere           1 5958.4
<none>               5958.8
- obstruct         1 5959.6
- differentiation  2 5962.1
- surg             1 5962.5
- local_spread     3 5972.7
- rx               2 5979.7
- node4            1 6026.0

Step:  AIC=5957.27
Surv(time, status) ~ rx + sex + perfor + adhere + surg + obstruct + 
    differentiation + node4 + local_spread

                  Df    AIC
- perfor           1 5955.9
- sex              1 5956.0
- adhere           1 5956.8
<none>               5957.3
- obstruct         1 5958.4
+ age              1 5958.8
- differentiation  2 5960.5
- surg             1 5961.0
- local_spread     3 5971.5
- rx               2 5978.2
- node4            1 6026.0

Step:  AIC=5955.87
Surv(time, status) ~ rx + sex + adhere + surg + obstruct + differentiation + 
    node4 + local_spread

                  Df    AIC
- sex              1 5954.5
- adhere           1 5955.8
<none>               5955.9
+ perfor           1 5957.3
+ age              1 5957.3
- obstruct         1 5957.4
- differentiation  2 5959.2
- surg             1 5959.6
- local_spread     3 5970.6
- rx               2 5976.7
- node4            1 6024.2

Step:  AIC=5954.51
Surv(time, status) ~ rx + adhere + surg + obstruct + differentiation + 
    node4 + local_spread

                  Df    AIC
- adhere           1 5954.5
<none>               5954.5
+ sex              1 5955.9
+ age              1 5955.9
+ perfor           1 5956.0
- obstruct         1 5956.2
- differentiation  2 5957.8
- surg             1 5958.2
- local_spread     3 5969.3
- rx               2 5975.0
- node4            1 6023.5

Step:  AIC=5954.5
Surv(time, status) ~ rx + surg + obstruct + differentiation + 
    node4 + local_spread

                  Df    AIC
<none>               5954.5
+ adhere           1 5954.5
+ perfor           1 5955.6
+ sex              1 5955.8
+ age              1 5956.1
- obstruct         1 5956.1
- surg             1 5958.4
- differentiation  2 5958.5
- local_spread     3 5971.1
- rx               2 5975.0
- node4            1 6022.9
summary(stepwise_model)
Call:
coxph(formula = Surv(time, status) ~ rx + surg + obstruct + differentiation + 
    node4 + local_spread, data = df)

  n= 929, number of events= 468 

                            coef  exp(coef)   se(coef)      z Pr(>|z|)    
rxLev                  0.0007498  1.0007501  0.1081783  0.007  0.99447    
rxLev+5FU             -0.5078980  0.6017591  0.1192067 -4.261 2.04e-05 ***
surg                   0.2513711  1.2857872  0.1013684  2.480  0.01315 *  
obstruct               0.2215931  1.2480634  0.1142954  1.939  0.05253 .  
differentiationpoor    0.3531967  1.4236111  0.1205339  2.930  0.00339 ** 
differentiationwell    0.0340964  1.0346843  0.1617970  0.211  0.83309    
node4                  0.8455285  2.3292085  0.0967209  8.742  < 2e-16 ***
local_spreadmuscle    -1.1345405  0.3215698  0.2552751 -4.444 8.81e-06 ***
local_spreadserosa    -0.5565149  0.5732032  0.1935786 -2.875  0.00404 ** 
local_spreadsubmucosa -1.1924159  0.3034872  0.4872601 -2.447  0.01440 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      exp(coef) exp(-coef) lower .95 upper .95
rxLev                    1.0008     0.9993    0.8096    1.2371
rxLev+5FU                0.6018     1.6618    0.4764    0.7601
surg                     1.2858     0.7777    1.0541    1.5684
obstruct                 1.2481     0.8012    0.9976    1.5614
differentiationpoor      1.4236     0.7024    1.1241    1.8030
differentiationwell      1.0347     0.9665    0.7535    1.4208
node4                    2.3292     0.4293    1.9270    2.8154
local_spreadmuscle       0.3216     3.1097    0.1950    0.5304
local_spreadserosa       0.5732     1.7446    0.3922    0.8377
local_spreadsubmucosa    0.3035     3.2950    0.1168    0.7887

Concordance= 0.667  (se = 0.012 )
Likelihood ratio test= 146.1  on 10 df,   p=<2e-16
Wald test            = 147.8  on 10 df,   p=<2e-16
Score (logrank) test = 157  on 10 df,   p=<2e-16
# Multivariate model including variables selected based on stepwise variable selection. The same variables were significant based on anova test of the model that included all variables.

m4 <- coxph(Surv(time, status) ~ rx + age + surg + obstruct + 
    differentiation + node4 + local_spread, data = df)
summary(m4)
Call:
coxph(formula = Surv(time, status) ~ rx + age + surg + obstruct + 
    differentiation + node4 + local_spread, data = df)

  n= 929, number of events= 468 

                           coef exp(coef)  se(coef)      z Pr(>|z|)    
rxLev                  0.001883  1.001885  0.108199  0.017  0.98612    
rxLev+5FU             -0.507444  0.602033  0.119200 -4.257 2.07e-05 ***
age                   -0.002617  0.997387  0.003923 -0.667  0.50479    
surg                   0.251935  1.286513  0.101365  2.485  0.01294 *  
obstruct               0.213610  1.238140  0.114930  1.859  0.06308 .  
differentiationpoor    0.355638  1.427091  0.120577  2.949  0.00318 ** 
differentiationwell    0.031717  1.032226  0.161856  0.196  0.84464    
node4                  0.839740  2.315764  0.097116  8.647  < 2e-16 ***
local_spreadmuscle    -1.125910  0.324357  0.255575 -4.405 1.06e-05 ***
local_spreadserosa    -0.551200  0.576258  0.193725 -2.845  0.00444 ** 
local_spreadsubmucosa -1.197289  0.302012  0.487303 -2.457  0.01401 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      exp(coef) exp(-coef) lower .95 upper .95
rxLev                    1.0019     0.9981    0.8104    1.2386
rxLev+5FU                0.6020     1.6610    0.4766    0.7605
age                      0.9974     1.0026    0.9897    1.0051
surg                     1.2865     0.7773    1.0547    1.5693
obstruct                 1.2381     0.8077    0.9884    1.5510
differentiationpoor      1.4271     0.7007    1.1267    1.8075
differentiationwell      1.0322     0.9688    0.7516    1.4176
node4                    2.3158     0.4318    1.9144    2.8013
local_spreadmuscle       0.3244     3.0830    0.1966    0.5353
local_spreadserosa       0.5763     1.7353    0.3942    0.8424
local_spreadsubmucosa    0.3020     3.3111    0.1162    0.7849

Concordance= 0.667  (se = 0.012 )
Likelihood ratio test= 146.5  on 11 df,   p=<2e-16
Wald test            = 147.9  on 11 df,   p=<2e-16
Score (logrank) test = 157.4  on 11 df,   p=<2e-16
anova(m4)
Analysis of Deviance Table
 Cox model: response is Surv(time, status)
Terms added sequentially (first to last)

                 loglik   Chisq Df Pr(>|Chi|)    
NULL            -3040.3                          
rx              -3028.1 24.3435  2  5.175e-06 ***
age             -3026.8  2.5853  1  0.1078626    
surg            -3024.2  5.2257  1  0.0222552 *  
obstruct        -3022.7  3.0968  1  0.0784487 .  
differentiation -3015.6 14.1005  2  0.0008672 ***
node4           -2978.2 74.7933  1  < 2.2e-16 ***
local_spread    -2967.0 22.3477  3  5.522e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cox_summary <- tidy(m4)

cox_summary
# A tibble: 11 × 5
   term                  estimate std.error statistic  p.value
   <chr>                    <dbl>     <dbl>     <dbl>    <dbl>
 1 rxLev                  0.00188   0.108      0.0174 9.86e- 1
 2 rxLev+5FU             -0.507     0.119     -4.26   2.07e- 5
 3 age                   -0.00262   0.00392   -0.667  5.05e- 1
 4 surg                   0.252     0.101      2.49   1.29e- 2
 5 obstruct               0.214     0.115      1.86   6.31e- 2
 6 differentiationpoor    0.356     0.121      2.95   3.18e- 3
 7 differentiationwell    0.0317    0.162      0.196  8.45e- 1
 8 node4                  0.840     0.0971     8.65   5.30e-18
 9 local_spreadmuscle    -1.13      0.256     -4.41   1.06e- 5
10 local_spreadserosa    -0.551     0.194     -2.85   4.44e- 3
11 local_spreadsubmucosa -1.20      0.487     -2.46   1.40e- 2
c_index_m4 <- concordance(m4)
cat("Concordance of the model with multivariate stepwise v_select:",c_index_m4$concordance)
Concordance of the model with multivariate stepwise v_select: 0.6670357

Test whether proportional hazard assumptions are met for model 4 predictors

cox.zph(m4) # final model with stepwise variable selection
                  chisq df       p
rx               0.0758  2 0.96283
age              0.0400  1 0.84140
surg             1.0410  1 0.30758
obstruct         4.7833  1 0.02874
differentiation 18.8075  2 8.2e-05
node4           10.0945  1 0.00149
local_spread     1.7889  3 0.61735
GLOBAL          37.2172 11 0.00011
zph_test <- cox.zph(m4)

print(zph_test)
                  chisq df       p
rx               0.0758  2 0.96283
age              0.0400  1 0.84140
surg             1.0410  1 0.30758
obstruct         4.7833  1 0.02874
differentiation 18.8075  2 8.2e-05
node4           10.0945  1 0.00149
local_spread     1.7889  3 0.61735
GLOBAL          37.2172 11 0.00011
# plot the Schoenfeld residuals
plot(zph_test)

Insight: Differentiation, node4 and obstruct variables did not meet proportional hazards assumption.

Stratify model by variables violating roportional hazard assumption

m5 <- coxph(Surv(time, status) ~ rx + age + surg + strata(obstruct) + strata(differentiation) + node4 +
              local_spread, data = df)

summary(m5)
Call:
coxph(formula = Surv(time, status) ~ rx + age + surg + strata(obstruct) + 
    strata(differentiation) + node4 + local_spread, data = df)

  n= 929, number of events= 468 

                           coef exp(coef)  se(coef)      z Pr(>|z|)    
rxLev                 -0.002209  0.997793  0.108448 -0.020  0.98375    
rxLev+5FU             -0.494982  0.609582  0.119133 -4.155 3.25e-05 ***
age                   -0.001895  0.998107  0.003932 -0.482  0.62989    
surg                   0.263478  1.301449  0.101411  2.598  0.00937 ** 
node4                  0.819875  2.270215  0.097645  8.396  < 2e-16 ***
local_spreadmuscle    -1.160631  0.313288  0.256066 -4.533 5.83e-06 ***
local_spreadserosa    -0.566254  0.567648  0.194214 -2.916  0.00355 ** 
local_spreadsubmucosa -1.268435  0.281272  0.487704 -2.601  0.00930 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                      exp(coef) exp(-coef) lower .95 upper .95
rxLev                    0.9978     1.0022    0.8067    1.2341
rxLev+5FU                0.6096     1.6405    0.4826    0.7699
age                      0.9981     1.0019    0.9904    1.0058
surg                     1.3014     0.7684    1.0669    1.5876
node4                    2.2702     0.4405    1.8748    2.7491
local_spreadmuscle       0.3133     3.1919    0.1897    0.5175
local_spreadserosa       0.5676     1.7617    0.3879    0.8306
local_spreadsubmucosa    0.2813     3.5553    0.1081    0.7316

Concordance= 0.672  (se = 0.014 )
Likelihood ratio test= 125.7  on 8 df,   p=<2e-16
Wald test            = 127.1  on 8 df,   p=<2e-16
Score (logrank) test = 134  on 8 df,   p=<2e-16
cox.zph(m5) # final model with stratification by variables violating proportional hazard assumption
               chisq df      p
rx            0.2291  2 0.8918
age           0.0102  1 0.9194
surg          1.4847  1 0.2230
node4         8.6990  1 0.0032
local_spread  2.0043  3 0.5715
GLOBAL       12.8524  8 0.1170
zph_test <- cox.zph(m5)

print(zph_test)
               chisq df      p
rx            0.2291  2 0.8918
age           0.0102  1 0.9194
surg          1.4847  1 0.2230
node4         8.6990  1 0.0032
local_spread  2.0043  3 0.5715
GLOBAL       12.8524  8 0.1170
# plot the Schoenfeld residuals
plot(zph_test)

Insight: After model stratification by obstruct and differentiation, the proportional hazard assumption is met as the global p >0.05. Node4 slightly violates assumption, but the final model is not stratified by node 4 because the model concordance is attenuated when stratifying by node4.

Model comparision

library(knitr)

models <- list(m0, m1, m2, m3, m4, m5)

# Add descriptions for each model
descriptions <- c(
  "Model 0 - Base model",
  "Model 1 - Treatment",
  "Model 2 - Full variables",
  "Model 3 - Significant predictors in model 2",
  "Model 4 - Selected stepwisely",
  "Model 5 - model 4 Stratified"
)

# create a data frame to store results
results <- data.frame(Model = character(),
                      Description = character(),
                      AIC = numeric(),
                      BIC = numeric(),
                      C_Index = numeric(),
                      stringsAsFactors = FALSE)

# function to calculate and store metrics for each model
for (i in seq_along(models)) {
  model <- models[[i]]
  
  # Extract AIC and BIC
  aic <- AIC(model)
  bic <- BIC(model)
  
  # add C-index
  c_index <- concordance(model)$concordance
  
  # append results to the data frame
  results <- rbind(results, data.frame(
    Model_variables = descriptions[i],
    AIC = aic,
    BIC = bic,
    C_Index = round(c_index,3)
  ))
}

kable(results,caption = "Model evaluation matrics")
Model evaluation matrics
Model_variables AIC BIC C_Index
Model 0 - Base model 6080.552 6080.552 0.500
Model 1 - Treatment 6060.208 6068.505 0.554
Model 2 - Full variables 5958.758 6016.837 0.669
Model 3 - Significant predictors in model 2 5954.501 5995.986 0.667
Model 4 - Selected stepwisely 5956.059 6001.692 0.667
Model 5 - model 4 Stratified 4757.394 4790.582 0.672

K-fold cross validation

library(survival)
library(boot)  # for bootstrapping
library(survcomp)  # to calculate c-index
library(caret)


set.seed(1234)

# Cox model 
cox_model <- coxph(Surv(time, status) ~ rx + age + surg + strata(obstruct) + strata(differentiation) + node4 +
              local_spread, data = df)


# calculate the original c-index
c_index_original <- concordance(cox_model)
cat("original c-index:", c_index_original$concordance, "\n")
original c-index: 0.6718827 
# create a function for calculating c-index in each fold using concordance()
cox_cindex <- function(train_data, test_data) {
  fit <- coxph(Surv(time, status) ~ rx + age + surg + strata(obstruct) + strata(differentiation) + node4 + local_spread, data = train_data)
  # Calculate concordance on test data
  c_index <- concordance(fit, newdata = test_data)$concordance
  
  return(c_index)
}

# perform 5-fold cross-validation with stratification
K <- 5
folds <- createFolds(c(df$status, df$differentiation, df$rx), k = K, list = TRUE, returnTrain = TRUE)
cv_c_indices <- sapply(folds, function(train_indices) {
  train_data <- df[train_indices, ]
  test_data <- df[-train_indices, ]
  cox_cindex(train_data, test_data) # use the concordance function inside cox_cindex
})

# Print cross-validated c-indices
print(cv_c_indices)
    Fold1     Fold2     Fold3     Fold4     Fold5 
0.7012768 0.6365716 0.6360538 0.6957962 0.6648189 
# cross-validation c-indices
cat("cross-validated c-Indices for each fold:", cv_c_indices, "\n")
cross-validated c-Indices for each fold: 0.7012768 0.6365716 0.6360538 0.6957962 0.6648189 
cat("mean cross-validated c-Index:", mean(cv_c_indices), "\n")
mean cross-validated c-Index: 0.6669035 
# plot cross-validation c-indices
plot(cv_c_indices, type = "b", xlab = "Fold", ylab = "c-index", main = "c-index across folds")

Insight: The original model c-index (0.674) and mean cross-validation c-index (0.675) is very similar, suggesting the the final stratified model is stable and is not overfitting.

V. Results

Table 2. Univariate model: Survival after Chemotherapy for stage B/C Colon Cancer

Treatment Coefficient Hazard ratio 95% CI_upper 95% CI_lower P-value
Amisole (Lev) -0.027 0.974 0.784 1.209 0.809
Amisole + 5-FU -0.372 0.690 0.546 0.870 0.002

Table 3. Multivariate model: Survival after Chemotherapy for stage B/C Colon Cancer

Treatment Coefficient Hazard ratio 95% CI_upper 95% CI_lower P-value
Amisole (Lev) -0.011 0.989 0.795 1.231 0.923
Amisole + 5-FU -0.376 0.687 0.543 0.868 0.002
Age 0.007 1.007 0.999 1.015 0.069
Surge 0.244 1.276 1.042 1.562 0.018
Obstruction of colon 0.283 1.327 1.057 1.667 0.015
Differentiat ion_poor 0.374 1.453 1.145 1.844 0.002
Differentiat ion_well 0.069 1.072 0.774 1.483 0.677
More than 4 nodes (+) 0.930 2.534 2.089 3.074 3.75 x 10-21
Local spread_muscle -0.996 0.370 0.225 0.606 7.85 x 10-5
Local spread_serosa -0.501 0.606 0.414 0.886 0.010
Local spread_submucosa -1.322 0.267 0.093 0.763 0.014

References

[1]
Terry M. Therneau and Patricia M. Grambsch, Modeling survival data: Extending the Cox model. New York: Springer, 2000.
[2]
T. M. Therneau, A package for survival analysis in r. 2024. Available: https://CRAN.R-project.org/package=survival
[3]
R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2023. Available: https://www.R-project.org/
[4]
A. Kassambara, M. Kosinski, and P. Biecek, Survminer: Drawing survival curves using ’ggplot2’. 2021. Available: https://CRAN.R-project.org/package=survminer
[5]
W. N. Venables and B. D. Ripley, Modern applied statistics with s, Fourth. New York: Springer, 2002. Available: https://www.stats.ox.ac.uk/pub/MASS4/
[6]
A. C. Davison and D. V. Hinkley, Bootstrap methods and their applications. Cambridge: Cambridge University Press, 1997. Available: doi:10.1017/CBO9780511802843
[7]
Angelo Canty and B. D. Ripley, Boot: Bootstrap r (s-plus) functions. 2024.